Agentic artificial intelligence represents a major shift from traditional, stateless chatbots to intelligent systems capable of reasoning over extended time horizons, using tools, and retaining contextual memory across interactions. While this evolution unlocks new capabilities, it also introduces a fundamental infrastructure challenge: memory scalability.
As large language models expand to trillions of parameters and support context windows that span millions of tokens, the cost of maintaining inference memory is rising faster than compute performance. For enterprises deploying agentic AI at scale, memory — not processing power — is rapidly becoming the primary bottleneck.
Why memory is the new constraint in agentic AI
Modern transformer-based models rely on a mechanism known as the Key-Value (KV) cache to function efficiently. Instead of recomputing the entire conversation or task history for every token generated, the model stores intermediate states that allow it to continue reasoning from prior context.
In agentic workflows, this KV cache becomes more than a short-lived buffer. It effectively acts as long-term working memory across tools, sessions, and decision chains. As sequence length grows, KV cache size increases linearly, placing immense pressure on existing memory hierarchies.
The challenge is compounded by current infrastructure constraints. Organisations must choose between:
GPU High Bandwidth Memory (HBM) — extremely fast but scarce and expensive, or
General-purpose system or shared storage — affordable but far too slow for real-time inference.
Neither option scales efficiently for large, persistent AI contexts.
The hidden cost of today’s memory hierarchy
Conventional AI infrastructure is structured around a four-tier memory model:
G1: GPU HBM
G2: System RAM
G3: Local storage
G4: Shared or network-attached storage
As inference context spills from G1 into lower tiers, performance degrades sharply. When KV cache data reaches shared storage (G4), retrieval latencies jump into the millisecond range, forcing high-cost GPUs to idle while waiting for memory.
This inefficiency manifests in several ways:
Reduced tokens-per-second (TPS) throughput
Higher energy consumption per inference
Underutilised GPU resources
Inflated total cost of ownership (TCO)
The core issue is that KV cache is being treated like traditional enterprise data — when it is not.
KV cache is a new data class
Inference context differs fundamentally from durable enterprise data such as logs, records, or backups. KV cache is:
Derived, not authoritative
Latency-critical but short-lived
High-velocity and continuously updated
General-purpose storage systems are poorly suited to this workload. They waste power and compute cycles on durability, replication, and metadata management that agentic AI does not require.
This mismatch is now limiting the scalability of long-context AI systems.
NVIDIA ICMS: introducing a purpose-built memory tier
To address this growing gap, NVIDIA has introduced Inference Context Memory Storage (ICMS) as part of its upcoming Rubin architecture.
ICMS creates an entirely new memory tier — commonly referred to as G3.5 — positioned between system memory and shared storage. This tier is designed specifically for gigascale AI inference and the unique characteristics of KV cache.
Rather than relying on CPUs and generic storage protocols, ICMS integrates Ethernet-attached flash storage directly into the AI compute pod and offloads context management to the NVIDIA BlueField-4 Data Processing Unit (DPU).
This design allows agentic systems to retain massive working memory without consuming expensive GPU HBM.
Performance and efficiency gains
The benefits of ICMS are both practical and measurable.
By keeping active inference context in a low-latency flash tier close to the GPU, the system can pre-stage KV blocks back into HBM precisely when required. This minimises GPU decoder idle time and significantly improves throughput for long-context workloads.
According to NVIDIA, this approach can deliver:
Up to 5× higher tokens per second for large-context inference
Up to 5× better power efficiency compared to traditional storage paths
These gains come from eliminating unnecessary storage overhead and reducing idle compute time — not from increasing raw GPU count.
Storage networking becomes part of the compute fabric
Implementing ICMS requires a shift in how enterprises think about storage and networking.
The platform relies on NVIDIA Spectrum-X Ethernet, which provides high-bandwidth, low-latency, and low-jitter connectivity. This allows flash storage to behave more like extended memory than traditional block storage.
On the software side, orchestration frameworks play a critical role. Tools such as NVIDIA Dynamo and the Inference Transfer Library (NIXL) manage the movement of KV cache blocks across memory tiers in real time.
These systems ensure that inference context is located in the right tier — GPU memory, system RAM, or ICMS — exactly when the model needs it. The NVIDIA DOCA framework further supports this by treating KV cache as a first-class, network-managed resource.
Industry adoption and ecosystem support
The ICMS architecture is gaining rapid industry alignment. Major infrastructure and storage vendors, including Dell Technologies, HPE, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, WEKA, DDN, and others, are already developing ICMS-compatible platforms built around BlueField-4.
Commercial solutions are expected to reach the market in the second half of the year, making this architecture relevant for near-term enterprise planning.
Implications for enterprise infrastructure strategy
Adopting a dedicated AI context memory tier has significant implications for datacentre design and capacity planning.
1. Redefining data categories
CIOs and infrastructure leaders must recognise inference context as “ephemeral but latency-sensitive” data. Treating KV cache separately allows durable storage to focus on compliance, logging, and archival workloads.
2. Orchestration maturity becomes critical
Success depends on topology-aware scheduling. Platforms such as NVIDIA Grove place inference jobs close to their cached context, minimising cross-fabric traffic and latency.
3. Higher compute density per rack
By reducing HBM pressure, ICMS allows more effective GPU utilisation within the same physical footprint. This extends the lifespan of existing facilities but increases power and cooling density requirements.
Redesigning the datacentre for agentic AI
The rise of agentic AI forces a rethinking of traditional datacentre assumptions. Separating compute from slow, persistent storage is no longer viable when AI systems must recall vast amounts of context in real time.
By inserting a specialised memory tier, enterprises can decouple AI memory growth from GPU cost, enabling multiple agents to share a massive, low-power context pool. The result is lower serving costs, higher throughput, and more scalable reasoning.
As organisations plan their next infrastructure investments, memory hierarchy optimisation will be just as important as GPU selection. ICMS signals that the future of AI scaling is not only about faster compute — but about smarter memory architecture.